Researchers were studying the next generation RNAseq from human hepatic sinusoidal endothelial cells (EC) and mouse sinusoidal EC plated on soft vs hard gels.
Findings from the next generation RNAseq data from human hepatic sinusoidal endothelial cells plated on soft vs hard gels reveal the chemokine CXCL1 being among the top up-regulated genes from granulocyte and agranulocyte migration/diapedesis pathway.
The expression profiling was done by high throughput sequencing.
# Importing the Dataset
GEOid="GSE198462"
setwd(GEOid)
gse<-getGEO("GSE198462",GSEMatrix =TRUE)[[1]]
se = as(gse, "SummarizedExperiment")
metadata(se)
se@metadata
se@colData
if("GSE198462" %in% list.files()){"you already have it"}else{getGEOSuppFiles("GSE198462")}
# Creating a Single Dataframe
df1<-read.table("GSE198462_GeneCount_human.tsv.gz", skip=0, sep = "\t", header=TRUE)
df1<-subset(df1, select = -c(1,4,5,6)) #Removing unwanted columns
df1$GeneName[duplicated(df1$GeneName)] #Checking for duplicate genes
length(unique(df1$GeneName)) #Unique
df1%>%group_by(GeneName)%>%summarize_if(is.numeric,sum)%>%as.data.frame->df1a
df1a%>%head
df1b<-df1a[,-1] # New data frame, without GeneName column
rownames(df1b)<-df1a[,1] # Setting GeneName column values as rownames in df1b
df1b%>%head
colnames(df1b)<-c("SOFT_6h_1","SOFT_6h_2","SOFT_6h_3", "HARD_6h_1","HARD_6h_2", "HARD_6h_3", "SOFT_12h_1", "SOFT_12h_2","SOFT_12h_3","HARD_12h_1","HARD_12h_2","HARD_12h_3") # Shortening names - removing "Xnm_HHSEC"
se@colData%>%rownames
rownames(se@colData)<-colnames(df1b) # Setting column names from df1b as rownames of se@colData
# Our groups
se@colData$treatment.ch1
se@colData$treatment.ch1<-factor(str_replace_all(se@colData$treatment.ch1, "[^A-Za-z0-9_]", "."))
se@colData$characteristics_ch1.1
se@colData$characteristics_ch1.1<-factor(str_replace_all(se@colData$characteristics_ch1.1, "[^A-Za-z0-9_]", "."))
# Creating a Big Dataset
ddsfresh <- DESeqDataSetFromMatrix(countData = df1b,
colData = se@colData,
design = ~ characteristics_ch1.1,
metadata = metadata(se))df1b %>% head(5) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| SOFT_6h_1 | SOFT_6h_2 | SOFT_6h_3 | HARD_6h_1 | HARD_6h_2 | HARD_6h_3 | SOFT_12h_1 | SOFT_12h_2 | SOFT_12h_3 | HARD_12h_1 | HARD_12h_2 | HARD_12h_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5_8S_rRNA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5S_rRNA | 11 | 7 | 8 | 4 | 8 | 12 | 8 | 16 | 13 | 9 | 15 | 16 |
| 7SK | 6 | 5 | 2 | 4 | 14 | 7 | 17 | 7 | 4 | 0 | 8 | 6 |
| A1BG | 135 | 158 | 224 | 191 | 235 | 183 | 246 | 235 | 120 | 181 | 246 | 197 |
| A1BG-AS1 | 152 | 199 | 254 | 250 | 294 | 256 | 248 | 251 | 161 | 220 | 353 | 262 |
The package DESeq2 has two methods for transforming the data into a form which is more suitable for visualization of data - variance-stabilizing transformation (VST) and regularized logarithm (rlog).
For genes with high counts, both the VST and the rlog will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards a middle value.
Of the two previously mentioned functions, VST is faster to compute in comparison with rlog. It is less sensitive to high count outliers. The rlog function tends to work well on small datasets (n < 30), potentially outperforming the VST when there is a wide range of sequencing depth across samples (an order of magnitude difference).
Since we saw that some values are 0, we want to remove any such values from the data set, with filtering, we can obtain such datasets. Below we have the number of rows before filtering.
dds <- ddsfresh
nrow(dds)[1] 58674
The number of rows after filtering is the following:
keep <- rowSums(counts(dds)) > 1 # Here, we are removing 0 values
dds <- dds[keep,]
nrow(dds)[1] 37745
The new dataset will be used for the rest of the analysis.
This is a wrapper for the varianceStabilizingTransformation (VST) that provides much faster estimation of the dispersion trend used to determine the formula for the VST. The speed-up is accomplished by subsetting to a smaller number of genes in order to estimate this dispersion trend. The subset of genes is chosen deterministically, to span the range of genes’ mean normalized count.
As a result we can see transformed (variance stabilized) expression values.
vsd <- vst(dds, blind = FALSE)
assay(vsd) %>% head(3) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, latex_options = "scale_down")| SOFT_6h_1 | SOFT_6h_2 | SOFT_6h_3 | HARD_6h_1 | HARD_6h_2 | HARD_6h_3 | SOFT_12h_1 | SOFT_12h_2 | SOFT_12h_3 | HARD_12h_1 | HARD_12h_2 | HARD_12h_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5S_rRNA | 6.862699 | 6.696255 | 6.710668 | 6.587076 | 6.675858 | 6.788293 | 6.659977 | 6.884352 | 6.863882 | 6.713852 | 6.789576 | 6.845675 |
| 7SK | 6.698882 | 6.624494 | 6.471004 | 6.587076 | 6.818186 | 6.657427 | 6.854379 | 6.664776 | 6.583422 | 6.229673 | 6.639757 | 6.608676 |
| A1BG | 8.286483 | 8.279283 | 8.533839 | 8.472824 | 8.436159 | 8.254482 | 8.410909 | 8.514039 | 8.049102 | 8.243641 | 8.322872 | 8.240791 |
This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size.
rld <- rlog(dds, blind = FALSE)
assay(rld) %>% head(3) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, latex_options = "scale_down")| SOFT_6h_1 | SOFT_6h_2 | SOFT_6h_3 | HARD_6h_1 | HARD_6h_2 | HARD_6h_3 | SOFT_12h_1 | SOFT_12h_2 | SOFT_12h_3 | HARD_12h_1 | HARD_12h_2 | HARD_12h_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5S_rRNA | 3.415375 | 3.348201 | 3.353086 | 3.313299 | 3.339020 | 3.384764 | 3.332560 | 3.429407 | 3.417847 | 3.353777 | 3.385889 | 3.411788 |
| 7SK | 2.682985 | 2.658615 | 2.618376 | 2.647543 | 2.735676 | 2.669455 | 2.754229 | 2.672010 | 2.646393 | 2.591721 | 2.663206 | 2.652878 |
| A1BG | 7.536255 | 7.529819 | 7.733617 | 7.683650 | 7.656056 | 7.509064 | 7.635876 | 7.718450 | 7.349157 | 7.500328 | 7.563995 | 7.497437 |
In previous functions, blind was set to FALSE, which means that differences between cell lines and treatment will not contribute to the expected variance-mean trend of the experiment.
In the following plot are shown scatterplots using the log2 transform of normalized counts (left), using the VST (middle), and using the rlog (right). We added 1 in the log2 function to avoid taking the log of zero.
dds2 <- estimateSizeFactors(dds)
df <- bind_rows(
as.data.frame(log2(counts(dds2, normalized=TRUE)[, 1:2]+1)) %>%
mutate(transformation = "log2(x + 1)"),
as.data.frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
as.data.frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
colnames(df)[1:2] <- c("x", "y")
lvls <- c("log2(x + 1)", "vst", "rlog")
df$transformation <- factor(df$transformation, levels=lvls)
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid( . ~ transformation) +
ggtitle("Results of Transformation") +
theme(plot.title = element_text(hjust = 0.5))
While the rlog is on roughly the same scale as the log2 counts, the VST
has a upward shift for the smaller values.
The first step is using the function dist to calculate the Euclidean distance between samples. To ensure we have a roughly equal contribution from all genes, we use it on the VST data. We need to transpose the matrix of values using t, because the dist function expects the different samples to be rows of its argument, and different dimensions (here, genes) to be columns.
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists) # So that we have symmetric values inserted in the matrix
rownames(sampleDistMatrix) <- paste(vsd$genotype.ch1,vsd$characteristics_ch1.1, sep = " - " )
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(8, "Blues")) )(225)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors, main = "Euclidean Distance Between Samples")The distance between the groups is lower than the distance between the group members. Therefore, we observe four small clusters, which are clustering near the main diagonal. That indicates that, indeed, the groupings are occurring between group members in dependence of the treatment. Lighter color squares (lilac) are indicating that the distance between different groups is high, as mentioned, and the purple color squares are indicating smaller distances between the members of the same group. Of course, the distance between the same sample is 0, hence the color is the darkest.
Another way to visualize sample-to-sample distances is a principal components analysis (PCA), as well as the multidimensional scaling (MDS) function. Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset.
This is useful when we don’t have a matrix of data, but only a matrix of distances. Here we compute the MDS for the distances calculated from the VST data. Below we can see the plots.
In this method, the data points are projected onto the 2D plane such that they spread out in the two directions that explain most of the differences (figure below). The x-axis is the direction that separates the data points the most. The values of the samples in this direction are written PC1. The y-axis is a direction (it must be orthogonal - perpendicular - to the first direction) that separates the data the second most. The values of the samples in this direction are written PC2. The percent of the total variance that is contained in the direction is printed in the axis label. These percentages do not add to 100%, because there are more dimensions that contain the remaining variance.
Each unique combination of treatment and cell line is given its own color, so we have two different.
plotPCA(vsd, intgroup = c("treatment.ch1","characteristics_ch1.1"))Another plot, multidimensional scaling (MDS), is useful when we don’t have a matrix of data, but only a matrix of distances. Here we compute the MDS for the distances calculated from the rlog transformed counts and plot these in a figure below.
This method, unlike PCA, does not show the percent of variance explained by the individual components.
mds <- as.data.frame(colData(vsd)) %>%
cbind(cmdscale(sampleDistMatrix))
ggplot(mds, aes(x = `1`, y = `2`, color = treatment.ch1, shape = characteristics_ch1.1)) +
geom_point(size = 3) + coord_fixed() + ggtitle("Multidimensional Scaling with VST data")We can still see the clustering pattern as in the heatmap, in both PCA and MDS plots, as expected. Two groups are clustering on the opposite sides of the x-axis. The peach color is, as it is written in the legend, for the samples plated on hard gel, which corresponds to the circle shape, and the blue color, with the triangle shape is used to present samples plated on soft gel. We can see relation between PCA and MDS plots.
dds<-DESeq(dds)
res<-results(dds)
res<-results(dds, contrast=c("characteristics_ch1.1", "treatment..Plated.on.Soft.get", "treatment..Plated.on.Hard.gel"))
#mcols(res, use.names = TRUE)
summary(res)
out of 37745 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 1737, 4.6%
LFC < 0 (down) : 2075, 5.5%
outliers [1] : 10, 0.026%
low counts [2] : 18287, 48%
(mean count < 11)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Using R functions, we can lower the false discovery rate threshold. If we use the threshold as 0.05, the results can be seen as in the table below.
If we want to raise the log2 fold change threshold, so that we test for genes that show more substantial changes due to treatment, we simply supply a value on the log2 scale. For example, by specifying lfcThreshold = 1, we test for genes that show significant effects of treatment on gene counts more than doubling or less than halving.
res.05 <- results(dds, alpha = 0.05)
table(res.05$padj < 0.05) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Var1 | Freq |
|---|---|
| FALSE | 16566 |
| TRUE | 2882 |
resLFC1 <- results(dds, lfcThreshold=1)
table(resLFC1$padj < 0.1) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Var1 | Freq |
|---|---|
| FALSE | 37729 |
| TRUE | 6 |
In high-throughput biology, we are careful to not use the p values directly as evidence against the null, but to correct for multiple testing. There are 5546 genes with a p value below 0.05 among the 37735 genes for which the test succeeded in reporting a p value.
If we consider a fraction of 10% false positives acceptable, we can consider all genes with an adjusted p value below 10% = 0.1 as significant: 3812
We subset the results table to these genes and then sort it by the log2 fold change estimate to get the significant genes with the strongest down-regulation:
resSig <- subset(res, padj < 0.1)
head(resSig[ order(resSig$log2FoldChange), ]) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
|---|---|---|---|---|---|---|
| ACPT | 12.27100 | -2.970560 | 0.7645526 | -3.885358 | 0.0001022 | 0.0018115 |
| KRT34 | 19.29739 | -2.958338 | 0.4987749 | -5.931209 | 0.0000000 | 0.0000003 |
| ANKRD1 | 1732.66358 | -2.822059 | 0.1640594 | -17.201451 | 0.0000000 | 0.0000000 |
| MYOCD | 162.19281 | -2.807831 | 0.2958987 | -9.489164 | 0.0000000 | 0.0000000 |
| RP11-1090M7.1 | 19.36439 | -2.743918 | 0.4975424 | -5.514943 | 0.0000000 | 0.0000021 |
| MT1G | 18.56607 | -2.638828 | 0.7040862 | -3.747876 | 0.0001783 | 0.0028198 |
…and with the strongest up-regulation:
head(resSig[ order(resSig$log2FoldChange, decreasing = TRUE), ]) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
|---|---|---|---|---|---|---|
| BCAN | 12.07654 | 1.681754 | 0.5027500 | 3.345111 | 0.0008225 | 0.0095918 |
| H19_2 | 12.64732 | 1.640585 | 0.4659317 | 3.521086 | 0.0004298 | 0.0057884 |
| FOXS1 | 18.80486 | 1.600116 | 0.3262747 | 4.904199 | 0.0000009 | 0.0000373 |
| PPARGC1A | 12.96573 | 1.546851 | 0.4797272 | 3.224440 | 0.0012622 | 0.0132573 |
| AKR1B10 | 79.06716 | 1.483108 | 0.3105914 | 4.775111 | 0.0000018 | 0.0000639 |
| RP11-331F4.4 | 12.32292 | 1.432835 | 0.4241295 | 3.378296 | 0.0007294 | 0.0087665 |
From the description of the function, we know that normalized counts plus a pseudocount of 0.5 are shown by default. This is a quick way to visualize the counts for a particular gene. Here, that particular gene is AMIGO2.
topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene = topGene, intgroup=c("treatment.ch1"))An MA-plot provides a useful overview for the distribution of the estimated coefficients in the model, e.g. the comparisons of interest, across all genes. On the y-axis, the “M” stands for “minus” – subtraction of log values is equivalent to the log of the ratio – and on the x-axis, the “A” stands for “average”. You may hear this plot also referred to as a mean-difference plot, or a Bland-Altman plot.
The DESeq2 package uses a Bayesian procedure to moderate (or “shrink”) log2 fold changes from genes with very low counts and highly variable counts, as can be seen by the narrowing of the vertical spread of points on the left side of the MA-plot, which we can see below. The lfcShrink function performs this operation.
We wanted to shrink the noisy log2 fold changes and how this difference effects the data, so we applied statistical moderation. Therefore we included two different plots; with statistical moderation and without statistical moderation. We can also detect the individual points in our graph.
The grey dots represent non significant genes for our dataset. We can see how different our samples are in terms of read counts. The data points converge to zero at Y-axis because log (A/A) is zero.
Another useful diagnostic plot is the histogram of the p values. This plot is best formed by excluding genes with very small counts, which otherwise generate spikes in the histogram.
library("apeglm")
# resultsNames(dds)
res <- lfcShrink(dds, coef="characteristics_ch1.1_treatment..Plated.on.Soft.get_vs_treatment..Plated.on.Hard.gel", type="apeglm")
plotMA(res, ylim = c(-5, 5))plotMA(res, ylim = c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
points(baseMean, log2FoldChange, col="green", cex=2, lwd=2)
text(baseMean, log2FoldChange, topGene, pos=2, col="green")
})res.noshr <- results(dds, name="characteristics_ch1.1_treatment..Plated.on.Soft.get_vs_treatment..Plated.on.Hard.gel")
plotMA(res.noshr, ylim = c(-5, 5))hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
col = "#CDC8B1", border = "#8B8878", main= "Histogram of Frequency", xlab = "Base mean > 1")As we can see, there is only one outlier, with a really high frequency, but the rest has mostly similar frequencies.
library("genefilter")
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)
mat <- assay(vsd)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("characteristics_ch1.1","treatment.ch1")])
pheatmap(mat, annotation_col = anno, fontsize_row = 5)In the previous plot, we can see the heatmap of relative VST-transformed values across samples. In the heatmap, the dendrogram at the side shows us a hierarchical clustering of the samples. Such clustering can also be performed for the genes. Since the clustering is only relevant for genes that carry a signal, one usually would only cluster a subset of the most highly variable genes.
The result table for the dataset we chose contains the Ensembl, Alias and Symbol gene IDs. Bioconductor’s annotation packages help with mapping various ID schemes to each other.
We are using genome wide annotation for Human, primarily based on mapping using Entrez Gene identifiers.
columns(org.Hs.eg.db) [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
The mapIds function was used to add individual columns to our results table, which will be used further in the process.
The function select returns a data frame, while mapIds returns a vector.
#compare with your GENE ID and find the match
rownames(counts(dds))%>%head[1] "5S_rRNA" "7SK" "A1BG" "A1BG-AS1" "A1CF" "A2M"
database<-org.Hs.eg.db
columns(database) [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
# Functions: mapIds, select
#mapIds returns a vector
dfannotation1<-AnnotationDbi::select(x=database,
keys = rownames(counts(dds)),
column = c("ENSEMBL", "ENTREZID"),
keytype = "SYMBOL",
multiVals = "first")
dfannotation1%>%head(5) SYMBOL ENSEMBL ENTREZID
1 5S_rRNA <NA> <NA>
2 7SK <NA> <NA>
3 A1BG ENSG00000121410 1
4 A1BG-AS1 ENSG00000268895 503538
5 A1CF ENSG00000148584 29974
dfannotation2<-mapIds(x=database,
keys = rownames(counts(dds)),
column = "ENSEMBL",
keytype = "ALIAS",
multiVals = "first")
dfannotation2%>%head(5) 5S_rRNA 7SK A1BG A1BG-AS1
NA "ENSG00000202198" "ENSG00000121410" "ENSG00000268895"
A1CF
"ENSG00000148584"
duplicates<-dfannotation2[duplicated(dfannotation2)]saveRDS(ddsfresh, file = "dds.rds")
# Restore the object
ddsN<-readRDS(file = "dds.rds")
df2<-as.data.frame(counts(ddsN))
colData<-colData(ddsN)
metadata<-metadata(ddsN)
database<-org.Hs.eg.db
symbols <- AnnotationDbi::select(database, keys = rownames(df2),
column = c("SYMBOL"), keytype = "ALIAS")
#df2joined<-cbind(df2,symbols) #may not work properly if order is switched at some points
df2%>%mutate(RowNames=rownames(.))%>%right_join(symbols,by=c("RowNames"="ALIAS"))%>%filter(!is.na(SYMBOL))->dfjoined
dfjoined%>%select_if(is.numeric)->dfannotated
unique.col1 = make.names(dfjoined[,"SYMBOL"], unique=T)
rownames(dfannotated)<-unique.col1
symbols <- AnnotationDbi::select(database, keys = rownames(df2),
column = "ENSEMBL", keytype = "SYMBOL")
dfjoined %>% head(5) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), )| SOFT_6h_1 | SOFT_6h_2 | SOFT_6h_3 | HARD_6h_1 | HARD_6h_2 | HARD_6h_3 | SOFT_12h_1 | SOFT_12h_2 | SOFT_12h_3 | HARD_12h_1 | HARD_12h_2 | HARD_12h_3 | RowNames | SYMBOL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 5 | 2 | 4 | 14 | 7 | 17 | 7 | 4 | 0 | 8 | 6 | 7SK | RN7SK |
| 135 | 158 | 224 | 191 | 235 | 183 | 246 | 235 | 120 | 181 | 246 | 197 | A1BG | A1BG |
| 152 | 199 | 254 | 250 | 294 | 256 | 248 | 251 | 161 | 220 | 353 | 262 | A1BG-AS1 | A1BG-AS1 |
| 15 | 9 | 4 | 0 | 0 | 4 | 2 | 6 | 8 | 4 | 6 | 4 | A1CF | A1CF |
| 2 | 0 | 11 | 6 | 6 | 6 | 8 | 20 | 12 | 8 | 6 | 20 | A2M | A2M |
Volcano Plot is good for identifying changes in large data sets composed of replicate data. It combines a measure of statistical significance from a statistical test with the magnitude of the change, enabling quick visual identification of those data-points like genes. This way we can plot significance versus fold-change on the y and x axes, respectively.
dfjoined%>%select_if(is.numeric)->dfannotated
unique.col1 = make.names(dfjoined[,"SYMBOL"], unique=T)
rownames(dfannotated)<-unique.col1
dds <- DESeq(dds, betaPrior=FALSE)
dds$characteristics_ch1 <- as.factor(dds$characteristics_ch1)
dds$characteristics_ch1.1 <- as.factor(dds$characteristics_ch1.1)
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'pvalue')Bar plot is the most widely used method to visualize enriched terms. A barplot is used to display the relationship between a numeric and a categorical variable. It depicts the enrichment scores and gene count or ratio as bar height and color.
database<-org.Hs.eg.db #org.Mm.eg.db, org.Hs.eg.db
dds<-readRDS(file = "dds.rds")
#from DiffExprAnal we found these genes to be differentially expressed.
ddsGO <- DESeq(dds)
resGO <- results(ddsGO, contrast=c("characteristics_ch1.1","treatment..Plated.on.Soft.get","treatment..Plated.on.Hard.gel"))
resGO<-resGO[!is.na(resGO$padj),]
SigExpGenes<-resGO[(resGO$padj < 0.05) & (abs(resGO$log2FoldChange)>1),]
ggo1 <- groupGO(gene = rownames(SigExpGenes),
OrgDb = database,
keyType = "SYMBOL",
ont = "BP",
level = 1,
readable = F)
ggo2 <- groupGO(gene = rownames(SigExpGenes),
OrgDb = database,
keyType = "SYMBOL",
ont = "BP",
level = 2,
readable = F)
ego <- enrichGO(gene = rownames(SigExpGenes),
universe = rownames(counts(ddsGO)),
OrgDb = org.Hs.eg.db,
keyType = 'SYMBOL',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = F)
dfego<-as.data.frame(ego[,-8])
dfego[dfego$p.adjust<ego@pvalueCutoff & dfego$pvalue<ego@pvalueCutoff & dfego$qvalue<ego@qvalueCutoff,2] [1] "response to lipopolysaccharide"
[2] "cellular response to fatty acid"
[3] "response to molecule of bacterial origin"
[4] "regulation of smooth muscle cell proliferation"
[5] "smooth muscle cell proliferation"
[6] "response to decreased oxygen levels"
[7] "response to oxygen levels"
[8] "epithelial cell proliferation"
[9] "response to fatty acid"
[10] "response to hypoxia"
[11] "positive regulation of MAP kinase activity"
[12] "muscle cell proliferation"
[13] "vascular process in circulatory system"
[14] "regulation of cell junction assembly"
[15] "negative regulation of viral process"
[16] "body fluid secretion"
[17] "response to mechanical stimulus"
[18] "cellular response to hypoxia"
[19] "regulation of epithelial cell proliferation"
[20] "myeloid leukocyte activation"
[21] "cellular response to decreased oxygen levels"
[22] "negative regulation of viral genome replication"
[23] "positive regulation of cell junction assembly"
[24] "viral life cycle"
[25] "positive regulation of smooth muscle cell proliferation"
[26] "response to muscle stretch"
[27] "positive regulation of transforming growth factor beta production"
[28] "regulation of signaling receptor activity"
[29] "cellular response to oxygen levels"
[30] "response to tumor necrosis factor"
[31] "regulation of MAP kinase activity"
[32] "response to transforming growth factor beta"
[33] "negative regulation of cytokine production"
[34] "regulation of protein serine/threonine kinase activity"
[35] "cellular response to metal ion"
[36] "response to ketone"
[37] "response to prostaglandin"
[38] "positive regulation of protein serine/threonine kinase activity"
[39] "positive regulation of smooth muscle contraction"
[40] "positive regulation of kinase activity"
[41] "positive regulation of cytokine production"
[42] "positive regulation of secretion by cell"
[43] "negative regulation of smooth muscle cell proliferation"
[44] "mammary gland development"
[45] "peripheral nervous system development"
[46] "positive regulation of animal organ morphogenesis"
[47] "myeloid leukocyte differentiation"
[48] "cellular response to lipopolysaccharide"
[49] "cellular divalent inorganic cation homeostasis"
[50] "positive regulation of protein kinase activity"
[51] "positive regulation of T cell activation"
[52] "positive regulation of leukocyte activation"
[53] "regulation of viral genome replication"
[54] "monocyte differentiation"
[55] "regulation of viral life cycle"
[56] "response to reactive oxygen species"
[57] "cellular response to molecule of bacterial origin"
[58] "negative regulation of striated muscle cell differentiation"
[59] "cellular response to inorganic substance"
[60] "vasoconstriction"
[61] "positive regulation of cell activation"
[62] "regulation of hormone levels"
[63] "divalent inorganic cation homeostasis"
[64] "cellular response to tumor necrosis factor"
[65] "positive regulation of secretion"
[66] "viral process"
[67] "regulation of neuron death"
[68] "positive regulation of cytosolic calcium ion concentration"
[69] "positive regulation of nitric oxide biosynthetic process"
[70] "cell junction assembly"
[71] "isoprenoid catabolic process"
[72] "polyketide metabolic process"
[73] "aminoglycoside antibiotic metabolic process"
[74] "epithelial fluid transport"
[75] "doxorubicin metabolic process"
[76] "positive regulation of collateral sprouting"
[77] "regulation of chemokine-mediated signaling pathway"
[78] "cellular response to caffeine"
[79] "alpha-beta T cell activation"
[80] "regulation of transforming growth factor beta production"
[81] "positive regulation of epithelial cell apoptotic process"
[82] "positive regulation of nitric oxide metabolic process"
[83] "ERK1 and ERK2 cascade"
[84] "positive regulation of leukocyte chemotaxis"
[85] "regulation of viral process"
[86] "positive regulation of leukocyte cell-cell adhesion"
[87] "negative regulation of immune system process"
[88] "positive regulation of mitotic nuclear division"
[89] "transforming growth factor beta production"
[90] "cellular response to biotic stimulus"
[91] "lactation"
[92] "positive regulation of muscle contraction"
[93] "regulation of systemic arterial blood pressure"
[94] "regulation of striated muscle cell differentiation"
[95] "regulation of synapse assembly"
[96] "positive regulation of signaling receptor activity"
[97] "negative regulation of secretion"
[98] "gland development"
[99] "histamine secretion"
[100] "regulation of transforming growth factor beta1 production"
[101] "nerve growth factor signaling pathway"
[102] "learning or memory"
[103] "cellular response to transforming growth factor beta stimulus"
[104] "regulation of steroid metabolic process"
[105] "cellular calcium ion homeostasis"
[106] "CD4-positive, alpha-beta T cell activation"
[107] "positive regulation of lymphocyte activation"
[108] "regulation of alpha-beta T cell activation"
[109] "regulation of cytosolic calcium ion concentration"
[110] "parturition"
[111] "negative regulation of myotube differentiation"
[112] "positive regulation of synaptic transmission, GABAergic"
[113] "transforming growth factor beta1 production"
[114] "positive regulation of epidermal growth factor-activated receptor activity"
[115] "positive regulation of T cell apoptotic process"
[116] "cellular response to purine-containing compound"
[117] "negative regulation of peptidase activity"
[118] "calcium ion homeostasis"
[119] "regulation of granulocyte chemotaxis"
[120] "response to virus"
[121] "neuron death"
[122] "sprouting angiogenesis"
[123] "positive regulation of response to tumor cell"
[124] "positive regulation of immune response to tumor cell"
[125] "negative regulation of leukocyte activation"
[126] "smooth muscle contraction"
[127] "blood vessel endothelial cell proliferation involved in sprouting angiogenesis"
[128] "regulation of blood pressure"
[129] "positive regulation of protein tyrosine kinase activity"
[130] "cellular response to fibroblast growth factor stimulus"
[131] "response to metal ion"
[132] "striated muscle cell differentiation"
[133] "positive regulation of fibroblast migration"
[134] "primary alcohol catabolic process"
[135] "sodium ion export across plasma membrane"
[136] "negative regulation of activated T cell proliferation"
[137] "histamine transport"
[138] "antiviral innate immune response"
[139] "positive regulation of MAPK cascade"
[140] "negative regulation of hydrolase activity"
[141] "memory"
[142] "positive regulation of peptidyl-tyrosine phosphorylation"
[143] "positive regulation of nuclear division"
[144] "positive regulation of synapse assembly"
[145] "regulation of body fluid levels"
[146] "positive regulation of lymphocyte apoptotic process"
[147] "regulation of neuron differentiation"
[148] "positive regulation of cell-cell adhesion"
barplot(ego, showCategory = 12)Earlier we presented 148 biological characteristics and pathways that were identified regarding the present genes in sample. However, we used the bar plot to express the same but based on the adjusted p value. Hence, as it can be seen, response to lipopolysaccharide, cellular reponse to fatty acids and response to molecule of origin, are the three characteristics connected with the highest values of adjusted p value, and therefore with the most up regulated genes.
Since the barplot is only displaying most significant enriched terms, we chose to use cneplot, which extracts complex associations (linkages of genes and biological concepts as a network), to provide more information, for example, which genes are associated with the terms.
cnetplot(ego, categorySize = "pvalue", foldChange = rownames(SigExpGenes))To better comprehend the data, we can generate a dataset of significantly expressed genes. Gene expression is the process by which the information encoded in a gene is used to drive the construction of a protein molecule, hence, it is important to know which genes are expressed.
SigExpGenes$ENTREZID<-AnnotationDbi::mapIds(x=database,
keys = rownames(SigExpGenes),
column = "ENTREZID",
keytype = "SYMBOL",
multiVals = "first")
SigExpGenes%>%headlog2 fold change (MLE): characteristics_ch1.1 treatment..Plated.on.Soft.get vs treatment..Plated.on.Hard.gel
Wald test p-value: characteristics ch1.1 treatment..Plated.on.Soft.get vs treatment..Plated.on.Hard.gel
DataFrame with 6 rows and 7 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
A2ML1-AS1 18.4058 -1.23924 0.404432 -3.06416 2.18284e-03 2.01325e-02
AC012513.6 45.5618 -1.26695 0.203626 -6.22198 4.90918e-10 5.09615e-08
ACPT 12.2710 -2.97056 0.764553 -3.88536 1.02179e-04 1.81780e-03
AKR1B10 79.0672 1.48311 0.310591 4.77511 1.79608e-06 6.40810e-05
AKR1C3 178.3155 1.35422 0.423084 3.20084 1.37029e-03 1.41720e-02
ALS2CL 44.5509 -1.00743 0.216645 -4.65013 3.31721e-06 1.08078e-04
ENTREZID
<character>
A2ML1-AS1 100874108
AC012513.6 NA
ACPT NA
AKR1B10 57016
AKR1C3 8644
ALS2CL 259173
kk <- enrichKEGG(gene = SigExpGenes$ENTREZID,
organism = 'hsa',
pAdjustMethod = "BH",
qvalueCutoff = 0.2,
pvalueCutoff = 0.05)
kkres<-as.data.frame(kk@result)As we can see below, significantly expressed genes are related to diseases hepatitis C and influenza A.
significantkkres<-kkres[kkres$pvalue<kk@pvalueCutoff & kkres$p.adjust<kk@pvalueCutoff & kkres$qvalue < kk@qvalueCutoff,2,drop=F]
significantkkres Description
hsa05160 Hepatitis C
hsa05164 Influenza A
links<-sapply(rownames(head(kk)),function(x){browseKEGG(kk, x)})
geneList<-SigExpGenes$log2FoldChange
names(geneList)<-SigExpGenes$ENTREZIDAnother important set of informations is related to a biological pathway in which the gene is included in. We can use the function pathview to visualize KEGG pathways. Hence, the following example illustrates how to visualize the hsa05160 or hsa05164 pathway, which was enriched in our previous analysis.
# hsa05160
hsa05160 <- pathview(gene.data = geneList,
pathway.id = "hsa05160",
species = "hsa",
limit = list(gene=max(abs(geneList)), cpd=1))
# hsa05164
hsa05164 <- pathview(gene.data = geneList,
pathway.id = "hsa05164",
species = "hsa",
limit = list(gene=max(abs(geneList)), cpd=1))[1] "hsa05160" "hsa05164"
[[1]]
[1] "/Users/ivana/Desktop/Bioinf21.22/FunctGenomics/project/pics"
[[2]]
[1] "/Users/ivana/Desktop/Bioinf21.22/FunctGenomics/project/pics"
We can see the significantly expressed genes represented in the pathway map in shades of green.
The tables of Ensembl data is downloadable with the BioMart data-mining tool so we are using BioMart to extract the data we need.
The used datasets can be seen on the table given, with the Ensembl version.
#mkk <- enrichMKEGG(gene = SigExpGenes$ENTREZID,
#organism = 'hsa',
#pAdjustMethod = "BH",
#qvalueCutoff = 0.2,
#pvalueCutoff = 0.05)
#knitr::kable(head(mkk))
#mkkres<-as.data.frame(mkk@result)
#listMarts() %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))#ensembl <- useMart("ensembl")
#datasets <- listDatasets(ensembl)
#ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
#filters = listFilters(ensembl)
#annotation1<-AnnotationDbi::select(x=database,
#keys = rownames(counts(dds)),
#column = c("ENTREZID","SYMBOL"),
#keytype = "SYMBOL",
#multiVals = "first")
#annotation2<-getBM(attributes=c("ensembl_gene_id","description","chromosome_name", 'entrezgene_id',"uniprot_gn_symbol","hgnc_symbol"),
# filters = "entrezgene_id",
#values = annotation1$ENTREZID,
# mart = ensembl)
#annotation2$entrezgene_id<-as.character(annotation2$entrezgene_id)
#annotation3<-annotation1%>%inner_join(annotation2, by=c('ENTREZID' = 'entrezgene_id'))We can see the dataset as given for each group:
counts(dds) %>% head(5) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| SOFT_6h_1 | SOFT_6h_2 | SOFT_6h_3 | HARD_6h_1 | HARD_6h_2 | HARD_6h_3 | SOFT_12h_1 | SOFT_12h_2 | SOFT_12h_3 | HARD_12h_1 | HARD_12h_2 | HARD_12h_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5_8S_rRNA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5S_rRNA | 11 | 7 | 8 | 4 | 8 | 12 | 8 | 16 | 13 | 9 | 15 | 16 |
| 7SK | 6 | 5 | 2 | 4 | 14 | 7 | 17 | 7 | 4 | 0 | 8 | 6 |
| A1BG | 135 | 158 | 224 | 191 | 235 | 183 | 246 | 235 | 120 | 181 | 246 | 197 |
| A1BG-AS1 | 152 | 199 | 254 | 250 | 294 | 256 | 248 | 251 | 161 | 220 | 353 | 262 |
Now to create an understandable visualization we can use boxplots per group. First it can be seen on the left, boxplot 1, the plot is created by the boxplot function in R. In order to see any differences for the depth normalization, we also created the boxplot on the right side, for normalized counts.
The simplest approach to quantifying gene expression by RNA-seq is to count the number of reads that map (align) to each gene (read count).
As it can be seen below, usually genes pleated on hard gel have higher read counts than those on soft gel. However, we can’t really see anything because the range of the read counts is so large that it covers several orders of magnitude.
boxplot(counts(dds), main = "read counts only", cex = .6)Normalization is the process of scaling raw count values to account for the “uninteresting” factors. In this way the expression levels are more comparable between and/or within samples. As we can see, when non normalized, read counts are not aligned and are harder to understand.
boxplot(log2(counts(dds)+1), notch=TRUE,
main = "Non-normalized read counts",
ylab="log2(read counts)", cex = .6)After applying normalization, we can say that the two groups slighty differ, since usually genes pleated on hard gel have higher number of read counts than those on soft gel.
dds <- readRDS("dds.rds")
dds <- DESeq(dds)
counts.sf_normalized <- counts(dds, normalized=TRUE)
boxplot(counts.sf_normalized, main = "Normalized", cex = .6)The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias.
boxplot(log2(counts(dds, normalize= TRUE) +1), notch=TRUE,
main = "Size-factor-normalized read counts",
ylab="log2(read counts)", cex = .6)vsd<-vst(dds)
df2<-as.data.frame(assay(vsd))
ggplot(data = melt(df2), aes(x=variable, y=value)) + geom_boxplot(aes(fill=variable))We can also use dispersion plot in our data to understand per-gene dispersion estimates together with the fitted mean-dispersion relationship. We can draw the plots by the function plotDispEsts in R.
plotDispEsts(dds)The DESeq2 dispersion estimates are inversely related to the mean and directly related to variance. Based on this relationship, the dispersion is higher for small mean counts and lower for large mean counts. The dispersion estimates for genes with the same mean will differ only based on their variance. Therefore, the dispersion estimates reflect the variance in gene expression for a given mean value. This red displayed curve plots the estimate for the expected dispersion value for genes of a given expression strength. Each black dot is a gene with an associated mean expression level and maximum likelihood estimation (MLE) of the dispersion. ### Rlog Normalization
If we use the rlog normalization in our dataset, we can see the effects on the data by creating another boxplot, clustred by the group, LoxP and wild type (WT).
We can see the first entries in our dataset as given below:
rld <- rlog(dds)
assay(rld)%>%head(3) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| SOFT_6h_1 | SOFT_6h_2 | SOFT_6h_3 | HARD_6h_1 | HARD_6h_2 | HARD_6h_3 | SOFT_12h_1 | SOFT_12h_2 | SOFT_12h_3 | HARD_12h_1 | HARD_12h_2 | HARD_12h_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5_8S_rRNA | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 5S_rRNA | 3.421258 | 3.342352 | 3.348105 | 3.301041 | 3.331405 | 3.385469 | 3.323705 | 3.437823 | 3.424232 | 3.348899 | 3.386827 | 3.417254 |
| 7SK | 2.683653 | 2.654686 | 2.606460 | 2.641470 | 2.746321 | 2.667626 | 2.768296 | 2.670672 | 2.640090 | 2.574192 | 2.660159 | 2.647789 |
We can see in the boxplot that the data is coloured based on the group cluster.
The mean values differ between the groups but not by a huge difference, rather by wiggling.
There are outliers in each group, and the maximum value is similar nearly in all of them by minor differences.
df3<-as.data.frame(assay(vsd))
ggplot(data = melt(df3), aes(x=variable, y=value)) + geom_boxplot(aes(fill=variable))